:orphan:
Core Basics 2: Train a Classifier on a Star Multi-Table Dataset
===============================================================
In this notebook we learn how to train a classifier with a multi-table
data composed of two tables (a root table and a secondary table). It is
highly recommended to see the *Core Basics 1* lesson if you are not
familiar with Khiops.
Make sure you have installed `Khiops `__ and
`Khiops Visualization `__.
We start by importing Khiops, checking its installation and defining
some helper functions:
.. code:: ipython3
import os
import platform
import subprocess
from khiops import core as kh
# Define peek helper function
def peek(file_path, n=10):
"""Shows the first n lines of a file"""
with open(file_path, encoding="utf8", errors="replace") as file:
for line in file.readlines()[:n]:
print(line, end="")
print("")
# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()
Training a Multi-Table Classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We’ll train a “sarcasm detector” using the dataset ``HeadlineSarcasm``.
In its raw form, it contains a list of text headlines paired with a
label that indicates whether its source is a sarcastic site (such as
`The Onion `__) or not.
We have transformed this dataset into two tables such that the
text-label record
::
"groundbreaking study finds gratification can be deliberately postponed" yes
is transformed to an entry in a table that contains id-label records
::
97 yes
and various entries in a secondary table linking a headline id to its
words and positions
::
97 0 groundbreaking
97 1 study
97 2 finds
97 3 gratification
97 4 can
97 5 be
97 6 deliberately
97 7 postponed
Thus the ``HeadlineSarcasm`` dataset has the following multi-table
schema
::
+-----------+
|Headline |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId* |
| |Position |
+-1:n--->|Word |
+-------------+
The ``HeadlineId`` variable is special because it is a *key* that links
a particular headline to its words (a 1:n relation).
*Note: There are other methods more appropriate for this text-mining
problem. This multi-table setup is only for pedagogical purporses.*
To train a classifier with Khiops in this multi-table setup, this schema
must be codified in the dictionary file. Let’s check the contents of the
``HeadlineSarcasm`` dictionary file:
.. code:: ipython3
sarcasm_kdic = os.path.join("data", "HeadlineSarcasm", "HeadlineSarcasm.kdic")
print(f"HeadlineSarcasm dictionary file: {sarcasm_kdic}")
print("")
peek(sarcasm_kdic, n=15)
.. parsed-literal::
HeadlineSarcasm dictionary file: data/HeadlineSarcasm/HeadlineSarcasm.kdic
Root Dictionary Headline(HeadlineId)
{
Categorical HeadlineId;
Categorical IsSarcasm;
Table(Words) HeadlineWords;
};
Dictionary Words(HeadlineId)
{
Categorical HeadlineId;
Numerical Position;
Categorical Word;
};
As in the single-table case the ``.kdic``\ file describes the schema for
both tables, but note the following differences: - The dictionary for
the table ``Headline`` is prefixed by the ``Root`` keyword to indicate
that is the main one. - For both tables, their dictionary names are
followed by ``(HeadlineId)`` to indicate that ``HeadlineId`` is the key
of these tables. - The schema for the main table contains an extra
special variable defined with the statement
``Table(Words) HeadlineWords``. This is, in addition to sharing the same
key variable, is necessary to indicate the ``1:n`` relationship between
the main and secondary table.
Now let’s store the location main and secondary tables and peek their
contents:
.. code:: ipython3
sarcasm_headlines_file = os.path.join("data", "HeadlineSarcasm", "Headlines.txt")
sarcasm_words_file = os.path.join("data", "HeadlineSarcasm", "HeadlineWords.txt")
print(f"HeadlineSarcasm main table file: {sarcasm_headlines_file}")
print("")
peek(sarcasm_headlines_file, n=3)
print(f"HeadlineSarcasm secondary table file location: {sarcasm_words_file}")
print("")
peek(sarcasm_words_file, n=15)
.. parsed-literal::
HeadlineSarcasm main table file: data/HeadlineSarcasm/Headlines.txt
HeadlineId IsSarcasm
0 yes
1 no
HeadlineSarcasm secondary table file location: data/HeadlineSarcasm/HeadlineWords.txt
HeadlineId Position Word
0 0 thirtysomething
0 1 scientists
0 2 unveil
0 3 doomsday
0 4 clock
0 5 of
0 6 hair
0 7 loss
1 0 dem
1 1 rep.
1 2 totally
1 3 nails
1 4 why
1 5 congress
The call to the ``train_predictor`` will be very similar to the
single-table case but there are some differences.
The first is that we must pass the path of the extra secondary data
table. This is done with the ``additional_data_tables`` parameter that
is a Python dictionary containing key-value pairs for each table. More
precisely: - keys describe *data paths* of secondary tables. In this
case only :literal:`Headline`HeadlineWords` - values describe the *file
paths* of secondary tables. In this case only the file path we stored in
``sarcasm_words_file``
*Note: For understanding what data paths are see the “Multi-Table Tasks”
section of the Khiops ``core.api`` documentation*
Secondly, we specify how many features/aggregates Khiops will create
with its multi-table AutoML mode. For the ``HeadlineSarcasm`` dataset
Khiops can create features such as: - *Number of different words in the
headline* - *Most common word in the headline before the third one* -
*Number of times the word ‘the’ appears* - …
It will then evaluate, select and combine the created features to build
a classifier. We’ll ask to create ``1000`` of these features (the
default is ``100``).
With these considerations, let’s setup the some extra variables and
train the classifier:
.. code:: ipython3
sarcasm_results_dir = os.path.join("exercises", "HeadlineSarcasm")
sarcasm_report, sarcasm_model_kdic = kh.train_predictor(
sarcasm_kdic,
dictionary_name="Headline", # This must be the main/root dictionary
data_table_path=sarcasm_headlines_file, # This must be the data file for the main table
target_variable="IsSarcasm",
results_dir=sarcasm_results_dir,
additional_data_tables={"Headline`HeadlineWords": sarcasm_words_file},
max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table
max_trees=0, # by default Khiops constructs 10 decision tree variables
)
print(f"HeadlineSarcasm report file located at: {sarcasm_report}")
print(f"HeadlineSarcasm modeling dictionary file located at: {sarcasm_model_kdic}")
.. parsed-literal::
HeadlineSarcasm report file located at: exercises/HeadlineSarcasm/AllReports.khj
HeadlineSarcasm modeling dictionary file located at: exercises/HeadlineSarcasm/Modeling.kdic
We now may take a look at the results with the visualization tool:
.. code:: ipython3
# To visualize uncomment the line below
# kh.visualize_report(sarcasm_report)
*Note: In the multi-table case, the input tables must be sorted by their
key column in lexicographical order. To do this you may use the Khiops
``sort_data_table`` function or your favorite software. The examples of
this tutorial have their tables pre-sorted.*
Exercise time!
~~~~~~~~~~~~~~
Repeat the previous steps with the ``AccidentsSummary`` dataset. It
describes the characteristics of traffic accidents that happened in
France in 2018. It has two tables with the following schema:
::
+---------------+
|Accidents |
+---------------+
|AccidentId* |
|Gravity |
|Date |
|Hour | +---------------+
|Light | |Vehicles |
|Department | +---------------+
|Commune | |AccidentId* |
|InAgglomeration| |VehicleId* |
|... | |Direction |
+---------------+ |Category |
| |PassengerNumber|
+---1:n--->|... |
+---------------+
So for each accident we have its characteristics (such as ``Gravity`` or
``Light`` conditions) and those of each involved vehicle (its
``Direction`` or ``PassengerNumber``). The main task for this dataset is
to predict the variable ``Gravity`` that has two possible
values:``Lethal`` and ``NonLethal``.
We first save the paths of the ``AccidentsSummary`` dictionary file and
data table files into variables:
.. code:: ipython3
accidents_kdic = os.path.join(
kh.get_samples_dir(), "AccidentsSummary", "Accidents.kdic"
)
accidents_data_file = os.path.join(
kh.get_samples_dir(), "AccidentsSummary", "Accidents.txt"
)
vehicles_data_file = os.path.join(
kh.get_samples_dir(), "AccidentsSummary", "Vehicles.txt"
)
Print the file locations and use the function ``peek`` to list their contents
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Which table is the ``Root`` in this case?
.. code:: ipython3
print(f"Accidents dictionary file: {accidents_kdic}")
print("")
peek(accidents_kdic, n=40)
print(f"Accidents (main) data table: {accidents_data_file}")
print("")
peek(accidents_data_file)
print(f"Vehicles data table: {vehicles_data_file}")
print("")
peek(vehicles_data_file)
.. parsed-literal::
Accidents dictionary file: /github/home/khiops_data/samples/AccidentsSummary/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
Categorical Gravity;
Date Date;
Time Hour;
Categorical Light;
Categorical Department;
Categorical Commune;
Categorical InAgglomeration;
Categorical IntersectionType;
Categorical Weather;
Categorical CollisionType;
Categorical PostalAddress;
Table(Vehicle) Vehicles;
};
Dictionary Vehicle(AccidentId, VehicleId)
{
Categorical AccidentId;
Categorical VehicleId;
Categorical Direction;
Categorical Category;
Numerical PassengerNumber;
Categorical FixedObstacle;
Categorical MobileObstacle;
Categorical ImpactPoint;
Categorical Maneuver;
};
Accidents (main) data table: /github/home/khiops_data/samples/AccidentsSummary/Accidents.txt
AccidentId Gravity Date Hour Light Department Commune InAgglomeration IntersectionType Weather CollisionType PostalAddress
201800000001 NonLethal 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Vehicles-BehindVehicles-Frontal route des Ansereuilles
201800000002 NonLethal 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood NoCollision Place du général de Gaul
201800000003 NonLethal 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal NoCollision Rue nationale
201800000004 NonLethal 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection VeryGood 2Vehicles-Side 30 rue Jules Guesde
201800000005 NonLethal 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection Normal 2Vehicles-Side 72 rue Victor Hugo
201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersection LightRain Other D39
201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoIntersection Normal Other 4 route de camphin
201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoIntersection Normal Other rue saint exupéry
201800000009 NonLethal 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection Normal Other rue de l'égalité
Vehicles data table: /github/home/khiops_data/samples/AccidentsSummary/Vehicles.txt
AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObstacle ImpactPoint Maneuver
201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft
201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange
201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange
201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDirectionChange
201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft
201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked
201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance
201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None
201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft
We now save the results directory for this exercise:
.. code:: ipython3
accidents_results_dir = os.path.join("exercises", "AccidentSummary")
print(f"AccidentsSummary exercise results directory: {accidents_results_dir}")
.. parsed-literal::
AccidentsSummary exercise results directory: exercises/AccidentSummary
Train a classifier for the ``Accidents`` database with 1000 variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Save the resulting file locations into the variables
``accidents_report`` and ``accidents_model_kdic`` and print them.
Do not forget: - The target variable is ``Gravity`` - The key for the
``additional_data_tables`` parameter is :literal:`Accident`Vehicles` and
its value that of ``vehicles_data_file`` - Set ``max_trees=0``
.. code:: ipython3
accidents_report, accidents_model_kdic = kh.train_predictor(
accidents_kdic,
dictionary_name="Accident",
data_table_path=accidents_data_file,
target_variable="Gravity",
results_dir=accidents_results_dir,
additional_data_tables={"Accident`Vehicles": vehicles_data_file},
max_constructed_variables=1000,
max_trees=0,
)
print(f"AccidentsSummary report file: {accidents_report}")
print(f"AccidentsSummary modeling dictionary: {accidents_model_kdic}")
.. parsed-literal::
AccidentsSummary report file: exercises/AccidentSummary/AllReports.khj
AccidentsSummary modeling dictionary: exercises/AccidentSummary/Modeling.kdic
Take a look to the report
^^^^^^^^^^^^^^^^^^^^^^^^^
Which variables predict well the gravity of an accident?
.. code:: ipython3
# To visualize uncomment the line below
# kh.visualize_report(accidents_report)